776 research outputs found
Fair Evaluation of Global Network Aligners
Biological network alignment identifies topologically and functionally
conserved regions between networks of different species. It encompasses two
algorithmic steps: node cost function (NCF), which measures similarities
between nodes in different networks, and alignment strategy (AS), which uses
these similarities to rapidly identify high-scoring alignments. Different
methods use both different NCFs and different ASs. Thus, it is unclear whether
the superiority of a method comes from its NCF, its AS, or both. We already
showed on MI-GRAAL and IsoRankN that combining NCF of one method and AS of
another method can lead to a new superior method. Here, we evaluate MI-GRAAL
against newer GHOST to potentially further improve alignment quality. Also, we
approach several important questions that have not been asked systematically
thus far. First, we ask how much of the node similarity information in NCF
should come from sequence data compared to topology data. Existing methods
determine this more-less arbitrarily, which could affect the resulting
alignment(s). Second, when topology is used in NCF, we ask how large the size
of the neighborhoods of the compared nodes should be. Existing methods assume
that larger neighborhood sizes are better.
We find that MI-GRAAL's NCF is superior to GHOST's NCF, while the performance
of the methods' ASs is data-dependent. Thus, the combination of MI-GRAAL's NCF
and GHOST's AS could be a new superior method for certain data. Also, which
amount of sequence information is used within NCF does not affect alignment
quality, while the inclusion of topological information is crucial. Finally,
larger neighborhood sizes are preferred, but often, it is the second largest
size that is superior, and using this size would decrease computational
complexity.
Together, our results give several general recommendations for a fair
evaluation of network alignment methods.Comment: 19 pages. 10 figures. Presented at the 2014 ISMB Conference, July
13-15, Boston, M
Efficient Construction of Probabilistic Tree Embeddings
In this paper we describe an algorithm that embeds a graph metric
on an undirected weighted graph into a distribution of tree metrics
such that for every pair , and
. Such embeddings have
proved highly useful in designing fast approximation algorithms, as many hard
problems on graphs are easy to solve on tree instances. For a graph with
vertices and edges, our algorithm runs in time with high
probability, which improves the previous upper bound of shown by
Mendel et al.\,in 2009.
The key component of our algorithm is a new approximate single-source
shortest-path algorithm, which implements the priority queue with a new data
structure, the "bucket-tree structure". The algorithm has three properties: it
only requires linear time in the number of edges in the input graph; the
computed distances have a distance preserving property; and when computing the
shortest-paths to the -nearest vertices from the source, it only requires to
visit these vertices and their edge lists. These properties are essential to
guarantee the correctness and the stated time bound.
Using this shortest-path algorithm, we show how to generate an intermediate
structure, the approximate dominance sequences of the input graph, in time, and further propose a simple yet efficient algorithm to converted
this sequence to a tree embedding in time, both with high
probability. Combining the three subroutines gives the stated time bound of the
algorithm.
Then we show that this efficient construction can facilitate some
applications. We proved that FRT trees (the generated tree embedding) are
Ramsey partitions with asymptotically tight bound, so the construction of a
series of distance oracles can be accelerated
Multi-task Representation Learning for Pure Exploration in Linear Bandits
Despite the recent success of representation learning in sequential decision
making, the study of the pure exploration scenario (i.e., identify the best
option and minimize the sample complexity) is still limited. In this paper, we
study multi-task representation learning for best arm identification in linear
bandits (RepBAI-LB) and best policy identification in contextual linear bandits
(RepBPI-CLB), two popular pure exploration settings with wide applications,
e.g., clinical trials and web content optimization. In these two problems, all
tasks share a common low-dimensional linear representation, and our goal is to
leverage this feature to accelerate the best arm (policy) identification
process for all tasks. For these problems, we design computationally and sample
efficient algorithms DouExpDes and C-DouExpDes, which perform double
experimental designs to plan optimal sample allocations for learning the global
representation. We show that by learning the common representation among tasks,
our sample complexity is significantly better than that of the native approach
which solves tasks independently. To the best of our knowledge, this is the
first work to demonstrate the benefits of representation learning for
multi-task pure exploration
Deconvolution approach for floating wind turbines
Green renewable energy is produced by floating offshore wind turbines (FOWT), a crucial component of the modern offshore wind energy industry. It is a safety concern to accurately evaluate excessive weights while the FOWT operates in adverse weather conditions. Under certain water conditions, dangerous structural bending moments may result in operational concerns. Using commercial FAST software, the study's hydrodynamic ambient wave loads were calculated and converted into FOWT structural loads. This article suggests a Monte Carlo-based engineering technique that, depending on simulations or observations, is computationally effective for predicting extreme statistics of either the load or the response process. The innovative deconvolution technique has been thoroughly explained. The suggested approach effectively uses the entire set of data to produce a clear but accurate estimate for severe response values and fatigue life. In this study, estimated extreme values obtained using a novel deconvolution approach were compared to identical values produced using the modified Weibull technique. It is expected that the enhanced new de-convolution methodology may offer a dependable and correct forecast of severe structural loads based on the overall performance of the advised de-convolution approach due to environmental wave loading.publishedVersio
Instruction Mining: When Data Mining Meets Large Language Model Finetuning
Large language models (LLMs) are initially pretrained for broad capabilities
and then finetuned with instruction-following datasets to improve their
performance in interacting with humans. Despite advances in finetuning, a
standardized guideline for selecting high-quality datasets to optimize this
process remains elusive. In this paper, we first propose InstructMining, an
innovative method designed for automatically selecting premium
instruction-following data for finetuning LLMs. Specifically, InstructMining
utilizes natural language indicators as a measure of data quality, applying
them to evaluate unseen datasets. During experimentation, we discover that
double descent phenomenon exists in large language model finetuning. Based on
this observation, we further leverage BlendSearch to help find the best subset
among the entire dataset (i.e., 2,532 out of 100,000). Experiment results show
that InstructMining-7B achieves state-of-the-art performance on two of the most
popular benchmarks: LLM-as-a-judge and Huggingface OpenLLM leaderboard.Comment: 22 pages, 7 figure
Efficient Parallel Output-Sensitive Edit Distance
Given two strings and , and a set of operations allowed to
edit the strings, the edit distance between and is the minimum number
of operations required to transform into . Sequentially, a standard
Dynamic Programming (DP) algorithm solves edit distance with cost.
In many real-world applications, the strings to be compared are similar and
have small edit distances. To achieve highly practical implementations, we
focus on output-sensitive parallel edit-distance algorithms, i.e., to achieve
asymptotically better cost bounds than the standard algorithm when
the edit distance is small. We study four algorithms in the paper, including
three algorithms based on Breadth-First Search (BFS) and one algorithm based on
Divide-and-Conquer (DaC). Our BFS-based solution is based on the Landau-Vishkin
algorithm. We implement three different data structures for the longest common
prefix (LCP) queries needed in the algorithm: the classic solution using
parallel suffix array, and two hash-based solutions proposed in this paper. Our
DaC-based solution is inspired by the output-insensitive solution proposed by
Apostolico et al., and we propose a non-trivial adaption to make it
output-sensitive. All our algorithms have good theoretical guarantees, and they
achieve different tradeoffs between work (total number of operations), span
(longest dependence chain in the computation), and space.
We test and compare our algorithms on both synthetic data and real-world
data. Our BFS-based algorithms outperform the existing parallel edit-distance
implementation in ParlayLib in all test cases. By comparing our algorithms, we
also provide a better understanding of the choice of algorithms for different
input patterns. We believe that our paper is the first systematic study in the
theory and practice of parallel edit distance
Parallel Longest Increasing Subsequence and van Emde Boas Trees
This paper studies parallel algorithms for the longest increasing subsequence
(LIS) problem. Let be the input size and be the LIS length of the
input. Sequentially, LIS is a simple problem that can be solved using dynamic
programming (DP) in work. However, parallelizing LIS is a
long-standing challenge. We are unaware of any parallel LIS algorithm that has
optimal work and non-trivial parallelism (i.e., or
span).
This paper proposes a parallel LIS algorithm that costs work,
span, and space, and is much simpler than the previous
parallel LIS algorithms. We also generalize the algorithm to a weighted version
of LIS, which maximizes the weighted sum for all objects in an increasing
subsequence. To achieve a better work bound for the weighted LIS algorithm, we
designed parallel algorithms for the van Emde Boas (vEB) tree, which has the
same structure as the sequential vEB tree, and supports work-efficient parallel
batch insertion, deletion, and range queries.
We also implemented our parallel LIS algorithms. Our implementation is
light-weighted, efficient, and scalable. On input size , our LIS
algorithm outperforms a highly-optimized sequential algorithm (with cost) on inputs with . Our algorithm is also much faster
than the best existing parallel implementation by Shen et al. (2022) on all
input instances.Comment: to be published in Proceedings of the 35th ACM Symposium on
Parallelism in Algorithms and Architectures (SPAA '23
- …